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ABSTRACT 

The purpose of this research is to demonstrate that a 
systematic approach to the graphical analysis of Rasch model 
residuals can lead to an increased understanding of ordered response 
data, and that residual patterns do change in predictable ways, and 
that summary statistics need not be the only piece of evidence for 
assuring the fit between model and data* Three simple, idealized 
simulations and then two sets of real data are considered* The 
research concludes that (l) the measurement error uncovered in the 
residual analyses was not noticeable in the examination of person and 
item estimates, nor the person and item fit statistics; and (2) the 
tailored residuals provided a specific frame of reference within 
which the observed variation would be understood. (PN) 
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Or The Simulation and Analysis of Measurement Model Residuals 
A presentation prepared for Educational Testing Service 
by Larry H* Ludlow, Boston College 
January 11, 1984 

I became attracted to the analysis of measurement model residuals » specifically 
the Rasch models because it was apparent that the practical techniques commonly applied 
to regression^ anova^ factor analytic and practically any other statistical model re- 
sidual were not being applied in the area of measurement* 

Techniques for the inspection of residuals had been proposed but there was no 
boundary defining graphical investigation of residual patterns for data which fit the 
models nor for deviations from the model when specific forms of data misfit are en- 
countered* There was nothing like Draper St Smith to turn to* 

Whenever an analysis of fit was discussed it was usually in terms of summary fit 
statistics and» as summary statistics^ they do not usually provide the detailed inter- 
action information that I am interested in when I analyze a set of data* Thus» the 
purpose of this research was to demonstrate that a systematic approach to the graphical 
analysis of Rasch model residuals can lead to an increased understanding of ordered 
response data, and that residual patterns do alter in predictable ways» and that sumrn* 
ary statistics need not be our only piece of evidence for assuring the fit between 
model and data* 

Firsts we will consider 3 simple^ idealized simulations and then look at two sets 
of real data* 

To reveal deviation from a model requires a background against which deviations 
are apparent* A background can be provided through the analysis of residuals from data 
simulated to fit the model* This research concentrated on two approaches to generating 
simulation data* In the firsts item and person parameters are sampled from specified 
distributions and then data are generated to fit the model, given those parameters* I 
refer to these as "tandom" simulations* This method is useful for exploring the effect 
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that certain test characteristics have upon the distri^bution of residuals. 

In the second approach^ I begin with the analysis of observed data. Then> the 
item and person estimates from those data are used as the simulati'^n parameters to 
generate data to fit the model* These are referred to as ^'tailored** simulations. 
Residuals produced by this method provide the relevant framework for revealing dev- 
iations from the model in observed data. This is because identical test character- 
istics should produce residuals which behave similarly if the observed data fit the 
model. The random simulations^ with generally relevant parameter distributions > 
establish the broad background for what might be expected. The tailored simulations^ 
with the observed estimates as parameters » focus on the observed data and define its 
particular baseline. 

The following 3 simulation studies utilized 100 persons^ 20 items» 3 response 
categories^ (scored 0»1»2) and twenty replications. These simulations use the Rating 
Scale model but the results hold for Dichotomous or Partial Credit data. 

Study 1 investigates the distribution of residuals under the limiting condition 
that all person measures and item calibrations equal zero. The intent is to reveal 
the pattern of residual variation when the parameters of the model are limited to the 
least variation possible. 

Figure 1 is a residual-by-measure plot for Study 1. Three clusters of residuals 
are apparent. If the data had been generated so that all measures and calibrations 
received estimates of exactly zero> then the expected score for every person on every 
item would be 1. Residuals^ in that case» would be just three points on this plot. 
Responses of **2** would have a residual of 1> responses would yield 0 valued resid- 
uals> and "0*' responses would have yielded -1 valued residuals. When standardized* 
thase values would be 0.0> and -1.2. Those three values lie in the center of each of 
these clusters. 

The three clusters^ and not just 3 points^ result because the generated data 
produced estimates that only approximated not equalled the original generating para- 
^™eters of zero. You also notice a skew to the distribution > this skew will be dis- 
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cussed In more detail later. 

Figure 2 contains a line plot of the residuals for three items. The line plot 
is simply a frequency distribution. The means and standard deviations of the residuals 
for each item are as expected under the model (0,1). There Is a slight skew to each 
distribution (also seen in Figure 1) but the proportion of residuals falling around 
each of the three most likely values (-1.2, 0.0, 1.2) Is about .33. This relation 
holds when data £lt the model and all measures and calibrations are Identical. When 
the estimates are identical each of the responses Is equally likely. 

These two figures illustrate that the lower limit on the number of residuals poss- 
ible on one Item is determined by the nitmber of response alternatives. The only way 
that residuals can dlsttibute In a continuous; pattern Is when the people and items 
are distributed in their estimates. Restrictions In the spread of measures or cali- 
brations result In clusters of residuals. 

In Study 2 the Item calibrations are sampled randomly from a uniform distribution 
with a range of four logits and a mean of zero. The person parameters are sampled ran- 
domly from a normal distribution with a mean of zero and standard deviation of one. 
This simulation investigates residuals from a testing situation In which the people are 
centered on the Instrument while the range in the sets of parameters is nearly Identical. 

Figure 3 Is the residual-by-measure plot for Study 2. There Is what I refer to 
as a ''structural skew" for the distribution of residuals (Z's) and It Is smooth in con- 
tour as the logits stretch from about -2.5 to 2.5. Unlike Figure 1, there is no appar- 
ent clustering of the residuals. Tills is because the range in measure and calibrations 
produces a nearly continuous distribution of expected scores, on which the distribution 
of residuals depends. 

The Z max and Z min boundary lines serve as guides for revealing the asymptotic 
nature of the residuals as the person measures and Item cal'^.brations diverge from one 
another. As the measures increase, large + Z*s become Impossible. As the measures 
decrease, large - Z's become Impossible. But as the measures become more extreme. 
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4. 

surprising responses produce ever greater residuals. This type of plot can be con- 
structed for items and the pattern ±s exactly opposite of what we see here* You can 
even build a 3D plot with measures^ calibrations and 2*s as the axes* I built a 
cardboard 3D model and while I found it fascinating I've yet to discover its practi- 
cal value* 

Figure A contains the line plots for the easiest^ most neutral* and hardest items* 
Skewness and kurtosis effects are most evident in the two extreme items* The neutral 
item fits a Gaussian distribution nearly perfectly* It is apparent that residuals 
should not be assumed to fit a normal distribution unless* perhaps* the item is cen- 
tered on the people* which is the case for the neutral item* When an item is not 
centered on the people* surprising failures and successes will lead to a skewed and 
peaked distribution of residuals* 

In general there is a positive linear relation between item calibration and the 
skewness of the residuals. On very easy items able persons respond mostly as expect- 
ed* contributing small residuals bunched around zero with an occasional Xarge negative 
residual. On very hard items less able persons respond mostly as expected* contributing 
small residuals bunched around zero with an occasional large positive residual. 

Also* in general* there is a quadratic relation between item calibration and the 
kurtosis of the residuals. This relation is due to the tendency of residuals to clus- 
ter near zero as item calibrations diverge from the mean person measure. Particularly 
for extreme items* the residuals cluster and form peaked and skewed distributions* 

Continuing with our inspection of the distributional properties of the residuals 
we turn to Figures 5-7. These figures contain rankit plots of the residuals* Figure 
5 is a rankit plot for the hardest item (d=I*85)* The Z's are ordered and these ob- 
served order statistics are plotted against their expected statistics* Here we see 
there are too many large positive residuals » clustering on the negative side of zero* 
and too f ew ^large negative residuals* Figure 6 is the rankit plot for the neutral 
item (d==*09)* It is as nearly **normal'* as one is likely to see* 
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Figure 7 Is a rankle plot of the easiest Item (d=1.53). We note too few posi- 
tive residuals^ a clustering on the positive side of zero» and too many large nega- 
tive residuals. These probability plots support our rough interpretation based on 
the line plots. Now are these unusual patterns? Only if we are expecting normally 
distributed residuals. But since these residuals are from dati which do fit the 
model their patterns constitute the standard of comparison not the normal distribu- 
tions. Agaln^ residuals^ even from data which fit the models can not be expected 
to distribute normally. 

A line plot of residuals may suggest that residuals are roughly continuous in 
their distribution. That is not actually the case. The data observed are categori- 
cal responses-only the expected responses approach continuity. The extent of contin- 
uity in the expected response depends on the variation in the person and item esti- 
mates. In the extreme case where all measures=callbrations> there is only one ex- 
pected response (as we saw earlier), ^!ost data» however^ yield expected responses 
which are close to continuous. The consequence of a ^.ategorical observed response - 
a continuous expected response is an approximately continuous residual distribution ^ 
that is composed of discrete **layers" of residuals. Each categorical response con- 
tributes a layer. The residuals in a line plot> therefore^ can be separated accord- 
ing to the number of response possibilities for the item. 

Figures 8-10 plot the residuals against the person measures for the same three 
items we have been considering. The same residuals ^re also shown back in Figure 4. 
Consider Figure 8> for the hard itern^ here we see three rather distinct patterns^ or 
layers of residuals. The upper-most residuals can from "2*' responses. The middle 
layer of residuals came from "1*' responses and the lower level residuals came from 
'*0" responses. Now from this we can tell that in Figure 4> the largest +Z can be 
identified as having come from a "2*' response by a mid-range ability person. We 
also can tell chat on this hard item most persons gave the *'0*' response^ as expected^ 
incurring small residuals. Figure 9 shows the pattern for the centered item> which 
had a nearly normal distribution of residuals. Here each of the responses is repre-^ 
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sent&d about equally. Figure 10 shows the pattern for the easiest item. Most per-* 
son;; did as expected by scoring a '*2" while the large - Z^s are due to "1" responses. 

This type of plot is useful for revealing which responses by what range of abil- 
it estimate led to surprising residuals » and how frequently wtiich responses were 
being used at various ability levels. 

Finally^ Study 3 again uses item calibrations sampled from a uniform distribul ^n 
with a range of four logits and a mean of zero. But» now the person estimates are 
sampled randomly from a normal distribution with a mean of one keepings stilly a 
standard deviation of one. This simulation investigates residuals from a testing 
situation in which the mean measure of the sample is greater than the mean calibra-* 
tion of the instrument while the standard deviation in the sets of estimates is 
nearly identical. 

Figure 11 is the residual-by-measure plot. The structural skew of the resid- 
uals is evident but is exaggerated in the positive direction and truncated in the 
negative direction of the person measure axis. This is because the mean measure of 
the people is located one logit above the mean calibration of the item. 

The general form of the distribution is identical to that for Study 2 in Figure 
3 but here a mistake by an able person produces a residual of greater magnitude than 
the residual for a less able person who scores a surprising success. \te. would get 
the opposite pattern^ though the same form> if the mean measure was less than the 
mean item calibration. 

Figure 12 contains the line plots for the easiest^ most neutral^ and hardest 
items. A^^ain» the skew and kurtosis relations are evident^ only more so. The large 
gap in the distribution for item if2 suggests that the item is operating as if there 
vere only two rather than three response categories. Scores are either "2'* leading 
to a small positive residual or "0" leading to a large negative residual. A '*layered" 
plot like Figures 8-10 would reveal the true situation. The distribution of residuals 
is most nearly symmetric for the neutral item; it^s difficulty is nearest the mean 



8 



7. 



measure of the sample, which was 1* The hardest item is only slightly harder than 
the mean measure. Hence, the distribution c"^ residuals contains relatively slight 
skew and kurtosis effects* 

In conclusion, these three idealized similations reveal how the general distri- 
bution of the residuals is affected by tne spread of the person and item estimates, 
and the difference between the mean estimates* Other factors influencing the shape 
of the distribution include the number of persons, items, and response categories* 

These studies illustrate that "unusual*' structures are the nonn> and that 
these structures can be predicted* Furthermore^ they illustrate that Rasch model 
residuals cannot be assumed to follow a normal distribution, except under very 
strict circumstances* 

The modelled asymmetry of the residuals effects how observed data residuals 
should be interpreted* For interpreting the fit of observed data to the models it is 
not enough to note that an item or person incurred a large residuals* Since asymmetry 
in the distribution of residuals can occur as a consequence of the model, there must 
be evidence that the appearance of large residuals is pattern-disrupting and unexpect- 
ed before model misfit can be claimed. Large residuals may be very informative about 
an item or person but their appearance does not necessarily mean something is wrong! 
Thus» any analysis of observed residual variation can only be under*"^ by comparing 
their patterns to those from residuals from data generated to simulate the observed 
data as closely as possible* An assumed, hypothetical distribution of person and 
item estimates is an inappropriate background for analyzing observed residuals* 

Now, two examples that illustrate some of the practical significance of analyzing 
observed data residuals in concert with residuals tailored to the testing situation* 

The first example discusses an instrument constructed to measure attitudes 
toward blindness* There are 19 items, 222 persons, and A response categories: 
strongly agree^S, agree-2, disagree*!^ strongly disagree^O, (the higher the score, 
the more positive the attitude.) Three interviewers collected the data from blind 
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patients who participated in a blind rehabilitation program at the United States 
Veterans Administration Hines Hospital, Hines, Illinois. The instrument was admin- 
istered prior to participating, immediately after release, and six months after re- 
lease. The data are calibrated with the Rating Scale model and, given the person 
and item estimates, multiple sets of tailored data were generated and analyzed. 
Some of the items include the following; 

1. A blind person can be a superior piano tuner. 

2. Blind people are more honest than sighted people. 

3. Blind workers complain less. 

4. A blind person develops extra senses. 

5* A blind person can raise a normal child. 

Since each simulation is one "what if" event, it is, obviously, prudent to 
replicate simulations. Otherwise, one runs the risk of treating a single simulation 
as "truth" and then building an analysis around discrepancies between that single 
case and the observed data* The risk in this strategy is that the single simulation 
might not resemble the mean pattern of additional replications. 

The problem is how many should be done, and do you compare multiple plots to one 
another? What I do is generate three sets of tailored data, generate my plots, and 
compare each set separately with the real data. Then see what consensus or differ- 
ences exist between the tailored data sets. Obviously, there is a degree of subject- 
ivity involved. 

Figure 13 plots the residuals from the tailored data against the person measures. 
The circled area highlights a part of the expected pattern that becomes significant 
when compared with the same area for Figure 14. In Figure 14 we notice a relatively 
large number of middle-grange attitude patients who have provided surprising disagree 
responses. Their low scoring responses, given their relatively high attitude measures 
resulted in large negative residuals. But which items are they, and which patients? 

To understand these residuals in clearer detail, we can plot the residuals in 
item sequence order. Figure 15 plots the tailored residuals and reveals the expected 
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patterns, t'igure 16 reveals the observed residual pattern. We see that most of the 
large negative residuals come from the first few items. 

This pattern suggested that some form of "start-up" effect might be influencing 
the measurement process. The next step> therefore^ was to construct line plots of 
the residuals broken down by time period and interviewer. Figure 17 contains the 
expected pattern from the tailored data and Figure 18 contains the pattern for the 
observed data. What is revealed is a pattern of large ne^ati^e residuals from sur- 
prising disagree responses at Time 1. 

The pattern for this item is typical for others of the first few 1 tms. True> 
there aren't many residuals here but we decided to check with the interviewers in 
order to uncover anything unusual in their techniques or patients. When the inter- 
viewers were presented with these patterns they explained that most patients did not 
respond using the original suggested response categories. Instead they responded 
"right'% "false"» "true'% "sometimes'% etc. Patients without strong convictions 
did not express their attitudes strongly. The interviewers were then required to 
interpret those responses. After a few items they usually picked up the patients' 
pattern and distinguished between middling and extreme responses. But each inter* 
viewer handled that situation in an idiosyncratic fashion. This "start-up" effect 
was a systematic source of measurement error at time period 1. It was partially 
remedied by introducing a few "warm-up" items. 

Since many line plots can be constructed in an analysis one simple way to sunonar- 
ize these line plots is to plot pairs of mean residuals for the items. If only two 
groups are created^ then stich a plot will have a negative slope because the means will 
approximately sum to zero. Otherwise they should be scattered about the origin. 

(The means of standardized residuals are not expected to sum to zero because the 
residuals are not standardized relative to a common error term. According to current 
terminology these residuals are "internally studentized". The transformation of the 
estimated residual into a scaled residual is accomplished by dividing by a standard 
error modelled for each expected response.) 
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Figures 19 and 20 contain plots of pairs of mean residuals for each item for 
Interviewer A (a oian) and Interviewer B (a woman). Their means are not expected to sum 
to zero because two other means were also computed (Interviewer C at Time 2^ Interviewer 
B at Time 3). Therefore^ the pairs of means may lie in any of the four quadrants. This 
pair of interviewers was selected because discussions with the interviewers suggested 
that the responses to some questions by some patients were influenced by having to respond 
to a woman. 

Figure 19 contains the set of tailored residual means. If there is no effect of 
interviewer gender on patient response^ then there should be no pattern when Interviewer 
A means at Time 1 are plotted against Interviewer B means at Time 1. Such a random 
pattern is seen if Figure 19. 

Figure 20> however^ contains three points that stand out from the others. In 
Quadrant II the discrepancy in scoring "work" has already been add'*:essed in terms of 
"start-up" effect. The relation between "sex" and "marriage"^ however^ is a new 
piece of information. (A blind person can offer their spouse satisfactory sex) and 
(Being blind is an asset to marriage). Interviewer B> the woman^ elicited surprising 
negative responses from some male patients on these two items. Their negative responses 
led to negative residuals. A similar configuration resulted when her means were plot- 
ted against the other man at Time 2. The responses she and the two men interviewers 
elicited on these two items are different. In particular^ these two items were hard 
for some patients to agree with when she conducted the interview. 

Further investigation revealed that most of these men had been interviewed by the 
women after t^he patients entered the hospital and that these interviews had b^en conduct- 
ed in their private rooms. The results of this analysis and anectodal evidence from 
rehabilitation staff members (r*jgarding the effect that a yomig woman and older nian 
walking off to a private room had upon the general population) led to a change in inter- 
view locale. The multitude of problems in these data suggested that a more global assess- 
ment of misfit might be informative. 
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Figure 21 contains the first two unrotated principal components extracted from the 
inter-*item correlations for the tailored residuals. Here we are concerned with the 
undimensiondlity of the instrument. The location of items should be random and without 
substantive meaning. This is what we interpret from Figure 21. We could make no sense 
of the configuration. 

Figure 22, however, contains the unrotated principal component solution for the 
observed residuals. The difference between the first two eigenrcots of the tailored and 
observed residuals an<f the shape of this principal component solution both indicate that 
a linear structure between the items still remains in the correlation matrix. In the 
negative direction of the first component are items which generally concern activities 
that a blind person might be able to do as well or even better than a sighted person* 
In the positive direction of the first component are items concerning affective character- 
istics blind persons might gain as a consequence of their blindness* These items question 
whether a blind person develops positive affective characteristics to a degree that he 
would not likely have attained if he were sighted* The presence of these item clusters 
mean that some patients respond to one group of items differently than the way they 
respond to one group of items differently than the way they respond to the other* 
Items in the negative direction ("activity items*') include the following (abbreviated): 

1. Can be superior piano tuners 
2* Can be good supervisors 

3. Can participate in group activities 
4* Can be sensitive social workers 
5* Can offer spouse satisfactory sex. 

Items in the positive direction ("affect items*') include the following (abbreviated): 

1* Can endure boring tasks more easily 

2. Are closer to spouse than sighted 

3. Blind workers complain less 

4* Can understand feelings better than sighted 
5* A blind person is an especially Loyal *^riend. 

This lack of unidimensionality was supported when a separate calibration of these 
item clusters was performed. These data were separated into two sets, each composed of 
the items in one cluster. Each set was separately calibrated. The pairs of person 
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attitude measures were plotted. This was done for the observed and tailorea data. If 
the scale is unidimensional » the pairs should fall along a straight line and be fairly 
highly correlated. If the scala is not unidimensional^ the pairs might form any type 
of pattern. 

Figure 23 shows the tailored pattern. Figure 24 shows the observed pattern. As 
can be readily seen> an overall estimate of attitude is not an accurate measure for i sub^ 
stantial number of persons taking this intrument. Two separate scores are now reportec\ 

The second example discusses DIAL (Developmental Indicators for the Assessment of 
Learning) » an instrument for the screening of gross motor^ fine motor^ concepts^ and 
communication skills. The function of DIAL is to identify children in need of follow*up 
services. Only the communication skills scale is analyzed here. There are 8 items^ 814 
children^ and as many as 7 perfortiance levels. The children range in age from 24 and 72 
months. They live in three regions of the United States and are stratified by sex and 
race. The data are calibrated vith the Partial Credit model. Three set:; of tailored 
data were generated and analyzed. 

Sample items include the following; 

1. Articulation of words 

3. Remember number^ sequence^ sentence 

3. Name the action presented 

8. Number of words in telling of story. . 

^# 

Figure 23 plots the tailored residuals against the children me'^sures. Figure 26 
contains the corresponding plot for the observed residuals. The structural skew is 
evident in each plot but the tailored pattern contains more large positive residuals 
and fewer largo negative residuals tlian does the observed pattern. Since the tailored 
residuals come from responses generated to fit the models these residuals inform us 
that occasional surprising successes can be expected under the model. But» these success- 
es are not foimd in the observed data! This is unusual because we usually expect the 
real data to have greater variation than the simulation data. Each of the tailored data 
sets gave a similar result^ some surprise was expected^ but not found. 

Given the nature of the administration^ and through discussions with the test develop-* 
nTpJ* we concluded that some administrators let their opinions of some children influence 
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their scoring objectivity. That Is^ some administrators did not make a serious effort 
to test younger^ less able children on hard items (alledgedly saving mutual time and 
frustration). This placed a ceiling on the child's ability^ denying potential extra 
credit if all tasks had been offered. This interpretation is supported by Figures 27 
and 28 which show that it was ti ^ hardest items which were expected to elicit the sur- 
prising successes* Figures 26 and 28 also reveal quite a few high ability children 
who performed less than expected (in the lower right corner) * 

One reason for the greater number of surprising failures is revealed by examining 
Figure 29» a table containing sorted residuals* Most of the large negative residuals 
in the earlier figures are contained in this table* Item L3 requires remembering skills* 
A series of tasks are presented and a child receives one point for successfully completing 
each task* The administrator is supposed to mark off each task completed^ not Just the 
highest level task completed* All the children under L3 in this table are bright (indi- 
cated in the B column) > are from the same region (indicated by first digit in ID field) > 
were administered the instrument by the same person (determined by examining the protocols) » 
did successfully complete nearly every step (determined by examining the protocols) > but 
are credited with a very low score* This occurred because the admlnistator marked Just 
the highest level completed* A score of "1** was then entered on a child's data record 
(at the time of data entry) because only one mark> not three > was recorded* That type of 
error was easily corrected* 

The other item in Figure 29» Ll» uses 15 words to test articulation skills^ e*g*» 
mouthy sandwich* This is the easiest item for most children to complete* The identi- 
fication codes again reveal a communality among these surprisingly low scoring children* 
These children all come from two areas in the southern region of the United States* It 
is possible that these children are giving a proper Southern articulation to words but 
the administrators do not have the skill to notice that or> perhaps > they are aware of the 
accent emphasis and have chosen to score children on a stricter criterion that they assume 
is more appropriate* That is ^ the administrator might decide that Southern articulation 
sj not as correct as some imagined "ideal" standard* Here> the use of local scoring 
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personnel who are aware of regional accents and who score responses relative to the re*- 
gional standard might help overcome this interaction. 

In addition to these specific but relatively minor problems item L5 (**name action") 
serves as an interesting example of residuals which "overfit" the model. The Wright W-P 
fit statistic (t= -6*65) indicated that children performed more consistent than expected 
on this item. TUBu is^ few children scored much more or less than expected, given their 
ability level. The more able kids were, generally, the older kids, they had greater 
experience. This type of inequity in experience is frequently encountered in *1iigh 
discrimination*' items. Figure 30 (tailored residuals), and Figure 3] (observed) reveal 
what an unexpectedly consistent performance means in terms of a residual pattern. Now 
the important feature, here, is the observation that the standard deviation of the observed 
residuals is less than that for the data tailored to fit the model. This narrow, con- 
stricted pattern for tha observed residuals is characteristic of items with relatively 
large negative fit statistics. Ana, such constricted residual variation leads to high 
discrimination indices. The response patterns leading to these residuals are more consist- 
ent than expected and are flagged by both negative fit statistics and high discrimination 
values because the residual variation is less than that expected under the model. 

In conclusion, the preceding discussion illustrates some of the practical utility 
of analyzing observed residual variation relative to tailored, or expected, residual 
variation. The measurement erroi uncovered in the residual analyses was not noticeable 
in the examination of person and item estimates, nor the person and item fit statistics. 
The tailored residuals, coming from data generated to fit the model-given the original 
estimates, provided a specific '"rame of reference within which the observed variation 
would be understood. The use of hypothetical distributions to generate data would not 
have provided a relevant background for either set of data. 

Such an analysis of observed residual variation is expedited when a general system- 
atic strategy is followed. One strategy found useful in our exploratory research entails 
a hierarchical sifting through of the data. In general, course techniques are employed 
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first, and person effects are studied before item effecti^. Those results direct the 
second level of analysis , and so on, until detailed analyses finish checking all leads 
suggestive of contributing measurement error. A detailed discussion of a general system- 
atic strategy (including possible patterns uncovered, their meaning and additional steps 
which might be taken) may be found in Ludlow (1983). 

My appreciation is extended to those members of the ETS community yho participated 
in the seminar and offered valuable feedback and interesting, suggestions for modifications 
to my plot."; and the creation of others. 
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Figure ^ —Standardized residuals versus person measures for a centered Item 
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Figure 15 —Residuals versus Item sequence: 
Tailored data-first set 
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Figure 16 "Residuals versus item sequence; 
Observed Data 
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figure 13— Residuals on 1128, "work", by interviewer and time period; 
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Figure 2 6"Reslduals versus person measures: 
Observed data 
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Figure 2 7->Reslduals versus item difficulties: 
Tailored data-first set 
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Figure 2 8 --Residuals versus item difficulties: 
Observed data 
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Figure 29 — Sorted, truncated, standardized residuals (z<=-3.0), 
for five year old children: Observed data 
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figure 30 -^-Residtials on 1.3^ **name action'^ by age group: 
Tatlorjd data-rirst set (d ^ 1*71) 
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figure 3X '"Residua Is on L^^ "name action'^ by age group: 
Observed data (d ~ 1.68} 
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